Skip to content

speech: persistent VkPipelineCache + qvac-speech-ggml hybrid packaging#7

Merged
gianni-cor merged 5 commits into
tetherto:speechfrom
GustavoA1604:speech
May 7, 2026
Merged

speech: persistent VkPipelineCache + qvac-speech-ggml hybrid packaging#7
gianni-cor merged 5 commits into
tetherto:speechfrom
GustavoA1604:speech

Conversation

@GustavoA1604

@GustavoA1604 GustavoA1604 commented May 7, 2026

Copy link
Copy Markdown

Summary

Sync the speech branch with the chatterbox.cpp Vulkan optimizations and finalize the qvac-speech-ggml-* packaging convention agreed in #qvac (per-addon ggml prefix; speech-branch fork is the speech build).

Five commits, +343/-40, covering:

  • Per-consumer ggml library filename prefix wired through both the build and the runtime DL loader (GGML_LIB_OUTPUT_PREFIXGGML_BACKEND_DL_PROJECT_PREFIX).
  • Optional persistent VkPipelineCache opt-in via GGML_VK_PIPELINE_CACHE_DIR, plus a crash-safe eager flush, recovering ~91% of the cold→warm shader-compile gap on drivers without an aggressive per-app system cache (Mesa/RADV, Android Adreno/Mali, fresh NVIDIA installs, containers). Tracks JIRA QVAC-17872.
  • Hybrid backend packaging cherry-pick from the 2026-01-30 branch, adapted from qvac-diffusion- to qvac-speech-. Lets the speech ggml ship as a static CPU core with MODULE GPU backends so an .aar/.apk/loose-DLL drop can dlopen Vulkan/OpenCL/CUDA at runtime without forcing the whole core dynamic.
  • Windows-correct pipeline-cache rename: switches both flush sites from C std::rename to std::filesystem::rename, which has POSIX overwrite semantics on Windows. Prevents the .pcache blob from getting frozen at its first-write size on the second and later saves on Windows.

Commits

  1. build: add GGML_LIB_OUTPUT_PREFIX option for per-consumer lib filename prefix
    Adds the GGML_LIB_OUTPUT_PREFIX cache var. When set, ggml libs land on disk as lib<prefix>ggml-*.{a,so,dll} and target_compile_definitions(ggml-base PRIVATE GGML_BACKEND_DL_PROJECT_PREFIX="<prefix>") is propagated so the runtime loader looks for the same names. CMake target names and the find_package(ggml CONFIG) package name are intentionally unchanged.

  2. vulkan: persistent VkPipelineCache, explicit opt-in via GGML_VK_PIPELINE_CACHE_DIR (QVAC-17872 round 1)
    Opt-in only; behaviour is byte-identical to upstream when GGML_VK_PIPELINE_CACHE_DIR is unset/empty. Cache file keyed on vendorID/deviceID/driverVersion. Save happens from ggml_vk_cleanup() (not ~vk_device_struct, which is unreliable at process exit because pipelines hold shared_ptr<vk_device_struct> ref cycles). Atomic save via tmp + rename.

  3. vulkan: crash-safe eager pipeline-cache flush (QVAC-17872 round 2)
    Flushes after every ggml_vk_load_shaders compile batch when the cache grew, so a process killed mid-graph doesn't lose freshly compiled pipelines. pipeline_cache_last_size book-keeping short-circuits both the eager flush and the cleanup-time flush on warm runs (cache-hit only). Without the short-circuit the unconditional flush regressed warm-run wall by ~90 ms on the chatterbox.cpp benchmark.

  4. cmake: support qvac hybrid backend packaging (cherry-pick from 2026-01-30)
    Cherry-pick of 512e1773 from the 2026-01-30 branch with the prefix swapped from qvac-diffusion- to qvac-speech-. Adds GGML_CPU_STATIC (CPU stays in the core .a, only GPU backends become MODULE shared libs), drops the GGML_BACKEND_DL requires BUILD_SHARED_LIBS check, makes ggml/ggml-base PIC under GGML_BACKEND_DL, and adds the Android filename-only dlopen fallback for flattened native dirs. Also pulls in the upstream-style unique_ptr conversion in ggml_backend_vk_reg_get_device (memory-leak fix; see design notes).

  5. vulkan: use std::filesystem for pipeline-cache path/rename (Windows-correct overwrite)
    Replaces the C std::rename / std::remove with std::filesystem::rename / std::filesystem::remove at both flush sites and builds pipeline_cache_path via std::filesystem::path joining. C rename on Windows fails if the destination exists, which meant the second-and-later flushes silently dropped on Windows.

Design notes (preempting common review questions)

These are deliberate choices that look unusual at first glance — calling them out so a re-review doesn't re-litigate them.

Why is qvac-speech-ggml- hardcoded as the no-macro fallback in backend_filename_prefix()?

Per the team agreement in #qvac (Gianfranco / Juan A.), each *.cpp addon that vendors its own ggml fork carries its own filename prefix to avoid dlopen filename collisions when multiple ggml versions coexist in one process / .aar / .apk:

  • fabric/ggmllibqvac-ggml-*
  • whispercpp/ggml (this branch) → libqvac-speech-ggml-*
  • diffusion/ggmllibqvac-diffusion-ggml-*

The speech branch is not meant to be used as a generic upstream-equivalent ggml — it is the speech build. Hardcoding the qvac-speech-ggml- filename in the no-macro fallback closes a real footgun: a downstream that builds the speech branch with -DGGML_LIB_OUTPUT_PREFIX= (empty) but doesn't define GGML_BACKEND_DL_PROJECT_PREFIX would otherwise produce libggml-*.so files but a loader hunting for libqvac-speech-ggml-*.so. Aligning both defaults to qvac-speech- makes the branch internally consistent.

The GGML_BACKEND_DL_PROJECT_PREFIX macro path is preserved so any downstream that does want to override it (e.g. a future addon vendoring this branch under a different prefix) still can.

Why isn't GGML_LIB_OUTPUT_PREFIX baked into ggml-config.cmake.in?

Intentional — find_package(ggml CONFIG) consumers set it on their own side before find_package. The find_library(NAMES "${GGML_LIB_OUTPUT_PREFIX}ggml" ggml ...) form gives a clean fallback to the bare name for unprefixed builds. We preferred that over @PACKAGE_GGML_LIB_OUTPUT_PREFIX@ substitution because:

  1. The consumer side already needs to know the prefix to satisfy other constraints (e.g. setting GGML_BACKEND_DL_PROJECT_PREFIX for a custom build, naming their own portfile artefacts). Putting one half of the contract in the package config and the other on the consumer side fragmented the convention.
  2. vcpkg consumers of this branch declare the prefix explicitly via qvac-speech-ggml-style port names, so the package config never sees a "wrong" prefix in practice.
  3. The GGML_MAX_NAME pass-through in ggml-config.cmake.in follows the same opt-in shape — same rationale, kept symmetric.

Why is the unique_ptr refactor in ggml_backend_vk_reg_get_device bundled in the cmake commit?

It's a clean cherry-pick of the same upstream change from the 2026-01-30 branch and was part of 512e1773, not a separate commit there. The new code is a real memory-leak fix: the previous devices.push_back(new ggml_backend_device {...}) and new ggml_backend_vk_device_context raw allocations were never freed (the static std::vector<ggml_backend_dev_t> held only raw pointers and is never torn down). Splitting it out from this PR would mean a follow-up cherry-pick that diverges the file from 2026-01-30 for no benefit. Calling it out here so it's not lost.

Why is the OpenCL qvac-parakeet patch: comment removed?

The comment described why ggml_backend_opencl_init returns nullptr instead of asserting on a zero-device list — but the behavior (the null-device guard) is preserved. The cherry-pick from 2026-01-30 additionally hardens ggml_backend_opencl_reg_device_get to also return nullptr when no devices exist, instead of asserting; with the guard now in two places, the original single-site comment was stale. The functional contract (ggml_backend_opencl_init may return nullptr and callers must fall back to CPU) is unchanged.

Why are the two pipeline-cache flush paths near-duplicates?

The cleanup-time flush in ggml_vk_save_pipeline_cache and the eager flush in ggml_vk_load_shaders look similar but differ in one important way: the eager path requires growth (blob.size() > pipeline_cache_last_size) before writing, whereas the cleanup path writes when size differs at all. Folding them into a single save(require_growth) helper saves ~10 lines but couples two call sites with subtly different correctness invariants. Left as-is for now; happy to refactor in a follow-up if a third call site shows up.

Why is getenv("GGML_VK_PIPELINE_CACHE_DIR") safe?

Called once per device init, in ggml_vk_get_device, before any worker threads exist. The result is captured into device->pipeline_cache_path and never re-read. No thread-safety concern.

Why no auto-discovery of $XDG_CACHE_HOME / $HOME?

ggml is a library distributed through package managers (vcpkg) and consumed by applications that should decide whether and where to persist Vulkan artefacts. Writing to the user's home directory without being asked is a side effect library consumers cannot see from the API surface. Apps that want default-on caching can set the env var in their bootstrap.

Test plan

  • Linux x86_64, Vulkan (RADV, Mesa): cold→warm chatterbox.cpp benchmark, env var set → ~91% of cold compile gap recovered. Env var unset → byte-identical to upstream timing.
  • Linux x86_64, no env var: behaviour byte-identical to upstream (no .pcache writes, device->pipeline_cache == VK_NULL_HANDLE, createComputePipeline takes VK_NULL_HANDLE).
  • macOS, MoltenVK: env var set → cache file written and re-loaded across runs.
  • Windows x86_64, NVIDIA: with the round-5 fix, .pcache size grows across multiple eager flushes and the cleanup flush in the same process; without it, second flush silently drops.
  • Android arm64, Adreno: hybrid CPU-static + Vulkan/OpenCL MODULE build links, dlopen finds libqvac-speech-ggml-vulkan.so from the flattened APK native dir via the new filename-only fallback.
  • OpenCL fallback: zero-device host (no Adreno) → ggml_backend_opencl_init() returns nullptr, ggml_backend_opencl_reg_device_get() returns nullptr instead of asserting; caller falls back to CPU.
  • vcpkg consumer build of qvac-speech-ggml: find_package(ggml CONFIG) resolves libqvac-speech-ggml.a / libqvac-speech-ggml-base.a and the MODULE GPU backends.
  • Existing ggml unit tests: pass with GGML_LIB_OUTPUT_PREFIX=qvac-speech- (default on this branch) and with GGML_LIB_OUTPUT_PREFIX= (explicit empty).

GustavoA1604 and others added 4 commits May 5, 2026 13:29
…INE_CACHE_DIR

Adds an opt-in persistent shader cache to ggml-vulkan.  Enabled only
when the caller sets GGML_VK_PIPELINE_CACHE_DIR to a non-empty path;
when unset or empty behaviour is byte-identical to upstream ggml-vulkan.

No auto-discovery of $XDG_CACHE_HOME or $HOME.  ggml is a library
distributed through package managers (vcpkg) and consumed by
applications that should decide whether and where to persist Vulkan
artefacts.  Writing to the user's home directory without being asked
is a side effect library consumers cannot see from the API surface.

When enabled, createPipelineCache is seeded from the path at init and
getPipelineCacheData is written back from ggml_vk_cleanup() (not
~vk_device_struct which is unreliable at process exit due to
shared_ptr ref cycles).  File keyed on vendorID/deviceID/driverVersion;
Vulkan validates the blob header and silently ignores stale data if the
shader bundle or driver changed.  Atomic save via tmp+rename.

Recovers ~91% of the cold->warm shader-compile gap on the first warm
run on drivers without an aggressive per-app system cache (Mesa/RADV,
Android Adreno/Mali, fresh NVIDIA installs, containers).

Backport from chatterbox.cpp PR GustavoA1604/chatterbox.cpp#8
(QVAC-17872, round-1).

Co-authored-by: Cursor <cursoragent@cursor.com>
Stacks on the previous patch.  Writes back the on-disk pipeline-cache
blob after every ggml_vk_load_shaders compile batch instead of only at
ggml_vk_cleanup() time, so a process killed mid-graph (SIGKILL,
abort, OS shutdown) doesn't lose the freshly compiled pipelines.

Adds pipeline_cache_last_size book-keeping so warm runs short-circuit
the disk write: the eager path only flushes when the cache actually
grew (blob.size() > last_size), and the cleanup path skips when size
matches last_size.  This avoided a +90 ms WALL regression measured
during dev when the flush was unconditional.

Backport from chatterbox.cpp PR GustavoA1604/chatterbox.cpp#8
(QVAC-17872, round-2).

Co-authored-by: Cursor <cursoragent@cursor.com>
…1-30)

Cherry-pick of 512e177 from the 2026-01-30 branch, with the lib filename
prefix swapped from qvac-diffusion- to qvac-speech-:

- drop the GGML_BACKEND_DL requires BUILD_SHARED_LIBS check; static ggml
  core now coexists with MODULE GPU backends when GGML_CPU_STATIC=ON.
- ggml_add_backend_library skips MODULE for ggml-cpu-* when GGML_CPU_STATIC,
  so CPU stays in the core .a and only Vulkan/OpenCL/CUDA become .so.
- ggml/ggml-base get POSITION_INDEPENDENT_CODE=ON when GGML_BACKEND_DL is
  set, so MODULE backends can link the static core.
- ggml gets GGML_USE_CPU compile-define when GGML_CPU_STATIC.
- backend_filename_prefix() defaults to libqvac-speech-ggml- (matches the
  GGML_LIB_OUTPUT_PREFIX default on this branch).
- ggml-config.cmake.in handles the hybrid mode: exports the static CPU
  variant target while leaving GPU backends to ggml_backend_load_best at
  runtime.
- ggml_backend_opencl_init keeps the speech-branch's null-device guard
  (drop-clean fallback when ggml-opencl rejects all visible devices).
@GustavoA1604 GustavoA1604 changed the title Update speech branch with chatterbox vulkan optimizations speech: persistent VkPipelineCache + qvac-speech-ggml hybrid packaging May 7, 2026
…orrect overwrite)

Two fixes on top of QVAC-17872 round-1/round-2:

1. Replace C `std::rename` / `std::remove` with `std::filesystem::rename` /
   `std::filesystem::remove` at both flush sites (cleanup-time flush in
   ggml_vk_save_pipeline_cache, and the eager flush in ggml_vk_load_shaders).
   The C runtime `rename` on Windows fails when the destination already
   exists (per MSDN), which meant the second-and-later saves of the
   .pcache blob would silently fail and the on-disk cache would never
   advance past its initial size on Windows.  std::filesystem::rename has
   POSIX overwrite semantics on every platform we target.

2. Build pipeline_cache_path with std::filesystem::path joining instead
   of `dir + "/" + fname` string concatenation.  Avoids mixed-separator
   surprises if a caller passes a backslash-terminated dir on Windows.

Behaviour-equivalent to round-2 on Linux/macOS; Windows now actually
persists subsequent flushes instead of dropping them.

Co-authored-by: Cursor <cursoragent@cursor.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants